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1 Introduction 


This report summarizes the progress on preprocessing of video sequences for data compress- 
ing during the first period of grant no. NAG3-1186. The end goal associated with this and 
subsequent research on the topic is a compression system for HDTV capable of transmitting 
perceptually lossless sequences at under one bit per pixel. We have concentrated on two 
subtopics designed to prepare the video signal for more efficient coding: 1) nonlinear filter- 
ing to remove noise and shape the signal spectrum to take advantage of insensitivities of 
human viewers, and 2) segmentation of each frame into temporally dynamic/static regions 
for conditional frame replenishment. The latter technique operates best, of course, under 
the assumption that the sequence can be modelled as a superposition of active foreground 
and static background. 

We have restricted our considerations to monochrome data, since we expect to use the 
standard luminance/chrominance decomposition, which concentrates most of the bandwidth 
requirements in the luminance. Similar methods to those discussed here may be applied 
to the two chrominance signals, but because the greatest compression ratio is available by 
attacking the component of highest energy, we postpone investigations of the treatment of 
chrominance to the upcoming coding research. 

The grant furnished financial support for two research assistants. Dr. Qian Wei, a post- 
doctoral associate, was responsible primarily for work related to nonlinear filtering[l]. Dr. 
Wei will continue his visit with the department until the summer of 1991. Ms. Coleen Jones, a 
Master’s degree candidate in the Department of Electrical Engineering, has nearly completed 
her thesis research under this grant, and is expected to defend her thesis within the next two 
months. Ms. Jones’ work involved literature review on several related topics, development 
of the algorithm for dynamics detection and estimation in Sec. 3, and simulations of its 
performance[2, 3]. Also among her work was the study of the frequency response of the 
human visual system(HVS), which is intended to assist the phase of this research dealing 
directly with data compression. The HVS work is not included in this report. A copy of Ms. 
Jones’ completed thesis will be sent to NASA Lewis on its approval by the University. 
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2 Nonlinear 3-D Prefiltering Algorithms 


In the following section, we discuss the development of prefiltering algorithms and their 
applications to the change detection problem. First, a new structure developed particularly 
for this application is introduced. Then the resulting properties are analyzed and their 
relevance to the considered problem is demonstrated. 


2.1 The Filter Structure 


The desired 3-D prefilter must satisfy some basic constraints such as edge preservation and 
zero phase behavior in space. Therefore the filter cannot be spatially causal. This ensures 
that the filtering operation does not destroy important image information by blurring and 
shifting image details. At the same time the filter structure should be computational simple 
and should preserve the dynamic range. This is especially important for real time imple- 
mentations of image processing algorithms. Furthermore, the algorithm should show robust 
performance in terms of its noise suppression capabilities with respect to various types of 
noise. The passband of the filter should match the characteristics of the human visual system 
as far as possible, i.e. usual designs using circular passbands cannot be used. 

The requirement stated above eliminates many of the well known filter design techniques 
and new concepts have to be investigated. Computational efficiency dictates the use of 
recursive filters in space and time, preferably low order realizations. Since recursive filters 
cannot be zero phase, the concept of causality inversion in space is employed to force zero 
phase behavior. At the same time, the constraint of edge preservation requires the overall 
filter to be nonlinear, in particular some type of rank order filter. The preservation of 
the dynamic range can be achieved by choosing the filter to have an aperiodic reponse. 
Considering the above conditions and the fact that the filter should be simple to design 
and its properties should be analyzable[4], the design of a “3-D Hybrid Median Filter” was 
chosen. It consists of two major filtering blocks, one being a one-dimensional recursive linear 
time-invariant “time-filter” of first order, the other being composed of four 2-D recursive 
linear shift-invariant spatial filters of first order. In the spatial block, the outputs of the 
four filters are combined in a nonlinear fashion, so that the total 2-D filter block becomes 
nonlinear. The resulting realization is shown in Fig. 3. 

The transfer function of the 2-D linear shift- invariant block H(zi, z 2 ) is given by: 


H(zi,z 2 ) 


(0.5 — a)z l 1 -f (0.5 — a)z 2 1 
1 — azf 1 — az 2 x 


( 1 ) 


where the varaibles z\ and z 2 are the z-transform variables of the spatial variables ni and n 2 . 
The parameter a is the only free parameter in the transfer function and has to be chosen 
properly for obtaining lowpass and aperiodic characteristics. This will be discussed in more 
detail in section 2.2.4. 


2 



(2) 


The 1-D first order time filter is described by: 


H(z 3 ) = 


1 ~ 

Z3 - a t 


where the variable 23 corresponds to the time variable t. Note that in this report, t will be 
an integer variable corresponding to frame numbers. As in the 2-D case, the coefficient a t 
determines the filter characteristics and has to be chosen properly. 

Denoting the 3-D input of the filter as :r(ni,n 2 ,i) and the 3-D output as Y(ni,n 2 ,t), the 
input-output relationship of the total filter structure is given by: 


Y(ni,n 2 ,t) = Median(x(n 1 ,n 2 ,t),yt{ni,n 2 ,t),Y ap (ni,n 2 ,t )) 


Y ap (ni,n 2 ,t) = Median[Median(y ++ (ni,n 2 ,t),x(ni,n 2 ,t),y~ (ni,n 2 ,f)), 

x(n u n 2 , t ), Median(y + ~(ni, n 2 , t), x(n 1 ,n 2 ,t), y~ + (n x , n 2 , t))] 


y ++ {ni,n 2 ,t) 

y~-(n u ri2,t) 

y~ + (ni,n 2y t) 

y + ~(n l ,n 2 ,t) 

yt(ni,n 2 ,t) 


Z l (H{z u z 2 )Xt(zi,Z2)) 

2- l (H(z;\z; 1 )X t (z u z 2 )) 

Z~ 1 (H(z~ 1 ,z 2 )X t {zi,Z 2 )) 

Z- 1 (H(z l ,zi l )X t (zi,z 2 )) 

Z- 1 (H(z 3 )X(z u z 2 ,z 3 )) 


with Z~ l and X f (2x,2 2 ) denoting the inverse Z-transform and the transform of a single 
input frame at time t respectively. The notation X(zi, z 2 , z 3 ) describes the Z-transforrn of 
the 3-D input signal. 

In the next section, we will explain in detail how this structure meets the required con- 
ditions and what other special properties it exhibits. 


2.2 Filter Properties 
2.2.1 General Properties 

Next, we state general properties of the filter by keeping the notation to a minimum. Proofs 
are not included for the sake of brevity. 

Property 1 (Pseudo-Linearity): 

Let 7£{) denote the filter operation and x(n lt n 2 , t ) any input signal, then the filter shows 
the following pseudo-linearity property: 

TZ{ax(ni,ri 2 ,t) + b} = a7^{i(ni,n 2 ,t)} + b (3) 

For the above property to hold, it is necessary that a,b are constants. 
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Property 2 (Response Bounds): 

The filter output F(nj, rc 2 , t ) is bounded by the five linear filter outputs 

y ++ (ni, n 2 , t ), y (ni , n 2 , i), y + “(ni, n 2 , t ), j/ _+ (n!, n 2 , <), y t (n 1? n 2 , t). 

The above property is useful for examining the filter response to noise. Other bounds 
can be formulated, which involve the output signals of the first stage of median filters. 
Qualitative properties concerning noise will be examined in more detail later. 

Property 3 (Symmetry): 

If the input signal x(ni,n 2 ,t) shows symmetry with respect to the ni- or n 2 - axis or if it 
shows point symmetry with respect to the origin, this symmetry is preserved in the output 
signal Y (ni, n 2 , t). 

The above property corresponds to zero-phase behavior of linear systems. 

Since a nonlinear filter does not allow an analytical characterization of its spectral prop- 
erties, much of the analysis work concentrates on the spatial domain. A video sequence can 
be interpreted as 3-D data and therefore the dimensionalty of image features is an impor- 
tant aspect in the analysis of the filter performance. The following statements assume, that 
binary image sequences are considered, i.e. the input to the filter can take only two values. 

Property 4 (Signal Dimensionality): 

(a) A zero-dimensional (point) input signal (an impulse in three dimensions) is removed. 

(b) A 1-D signal (line) is completely preserved, if it is zero-dimensional in the (ni,r? 2 ) 
plane. It is partially preserved if it is zero- dimensional in time. 

(c) A 2-D signal (plane) is completely preserved, if its orientation is parallel to two of the 
axis of ni, n 2 , t. 

(d) A 3-D signal is completely preserved. 

In the above property, 1-D, 2-D, and 3-D signals are assumed to be of infinite extent. 
The property shows, that the degree of preservation increases with signal dimensionality. 
In other words, signals which have a high correlation in two or more direction are given 
a higher preference than signals having a correlation only in one direction. Isolated (zero- 
dimensional) impulses, which do not have any neighbors in space and time are completely 
removed, providing the filter structure with an impulse noise suppression capability. The 
above property is essentially a multi-dimensional lowpass property, since sub-dimensional 
signals always produce high frequency components in certain directions of the spectrum. 

This leads us to the next subsection, which addresses the more general case of scene 
changes and motion and the corresponding filter response. 
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2.2.2 Scene Changes and Motion 

First, we consider the case of a stationary sequence in time, i.e. 

x(ni,n 2 ,t) = x(ni,n 2 ), for <>0 

where x(nj,n 2 ,i) is arbitrary for t < 0. Then for t — * oo, it can be shown that the filter 
response converges to the stationary input signal: 

TZ(x(ni,n 2 . l t)) —* x(ni,n 2 ,t), t — ► oo (4) 

The rate of convergence is proportional to a\, in particular: 

| 7£(x(ni,n 2 ,t)) — x(ni,ri 2 ,t) |< &(ni,n 2 )a{, ( > 0 (5) 

with k(ni,n 2 ) being a function of the initial conditions at (nj,n 2 ) and the neighborhood 
of (ni,n 2 ) in the stationary image. A more detailed analysis shows that parts of the image 
containing a large high frequency content, will converge slower than image portions with little 
high spatial frequency content. The parameter a t controls the duration of the transient phase 
for a change between two scenes and the parameter a determines the degree of bandlimitation 
of a single frame right after the scene change. 

This behavior matches to some degree the limitation of the human visual system, which 
after a scene change is relatively insensitive to image details. This sensitivity increases with 
the time elapsed after the scene change. A few hundred ms after the scene change, the 
full sensitivity for the stationary case is reached again[5]. This characteristic of the human 
visual system can be exploited in the video sequence compression problem. In inter frame 
compression schemes, complete scene changes usually cause difficulties, since these methods 
rely on the correlation between consecutive frames. Using this filter, it seems possible to 
momentarily reduce the spatial bandwidth, maintaining the desired compression rate without 
causing perceivable image quality degradation. 

This property of the filter is illustrated in Figures 4-8. Fig. 4 corresponds to the frame 
before the scene change or equivalently to the initial conditions before t = 0. Fig. 5 is the 
unfiltered new stationary input frame after the scene change. The transient filter response 
to this scene change is shown in Fig. 6- 8. In particular, Fig. 6 shows the first filter output 
frame after the scene change, Fig. 7 shows the second and Fig. 8 shows the 10th response 
frame after the scene change. The parameters used in this simulation were a* = 0.6, a = 0.4, 
which guarantees a sufficiently fast convergence rate. 

Next, we will consider the case of a sequence of highly uncorrelated frames in time. 
Obviously the output of the temporal 1-D filter cannot provide reliable information. Only 
the 2-D spatial filter together with the input signal provide useful information. Therefore, 
a significantly larger error can occur than for the stationary case, since the response at any 
time can be considered a transient response. 

An upper bound for the ‘error’ between filter input and filter output can be expressed 
as: 

I Y(ni,n 2 ,t) - x(ni,n 2 ,t) |<| T, p (ni,n 2 ,f) - x(n x ,n 2 ,t) | (6) 
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One could argue that since previous frames in time do not provide any information about 
the currently processed image, the spatial 2-D filtering module might be used instead of the 
3-D filter. But it can be shown that for the noiseless case, the response of the 3-D filter is 
preferable over the response of the 2-D filter in the mean square sense. 

The case of a partially correlated image sequence, consisting of a static background and an 
active dynamic region, can be considered to be a combination of the two previously considered 
cases. As a consequence, the stationary background maintains its original resolution and 
noise is effectively suppressed. In dynamic regions, resolution is reduced by a degree, which 
depends on the lowpass characteristics of the spatial filter, i.e. the parameter a. Again this 
behavior matches the sensitivity of the human visual system to a significant extent, since 
detail perception in areas of motion is more limited than for static areas[6]. 

2.2.3 Noise and Spectral Properties 

A complete statistical analysis of the 3-D filter is extremely difficult, since most of the inputs 
to the three point median operators are statistically dependent. It is however possible to 
investigate the statistical properties of the linear filter modules and the properties of the 
output signals, produced by the first stage of median operations. Although such a partial 
analysis provides important information for the analysis of the filter performance under 
noise, it has to be supported by extensive simulations. Next, we will provide a qualitative 
description of this performance: 

- The noise suppression of gaussian and impulsive noise is very effective in regions of 
approximately constant signal level in space and time. Stationary regions in time with 
a high spatial frequency content or areas with a large spatial low frequency component 
but changing rapidly in time still show significant improvements. 

- The 3-D filtering algorithm is usually less effective in areas of motion or areas of high 
temporal activity. This behavior can be exploited in the change detection algorithm, 
which will be introduced in section 3. 

- The noise intensity is reduced only insignificantly near moving spatial edges. This 
drawback can be corrected by using a slightly modified filter structure, which introduces 
an additional nonlinear time filter. 

The above properties are illustrated in Figures 9-12. In particular, Fig. 9 shows an image, 
which is a part of a stationary sequence corrupted by gaussian noise. Fig. 10 illustrates the 
corresponding output image, produced by the 3-D filter. A significantly improved noise 
suppression for the stationary case can be obtained by increasing the temporal coefficient a t . 
Fig. 11 is a frame from the “Walter” sequence, which is corrupted by gaussian noise. The 
corresponding output image is shown in Fig. 12, which indicates that noise suppression in 
the temporally active head region and along edges is not as good as in the static background. 
This property will be exploited later in the change detection scheme. 
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The spectral properties of individual 2-D filter modules are illustrated in Figs. 13 to 15. 
Fig. 13 shows the magnitude gain of a linear 2-D filter module, with a = 0.4. Fig. 14 shows 
the spectrum of a white noise input image and Fig. 15 shows the spectrum of the correspond- 
ing output image, produced by the nonlinear 2-D spatial filter. An interesting observation 
can be made, by comparing the spectral characteristics of the 2-D linear module and the 
2-D nonlinear spatial filter. Although the linear module does not overemphasize horizontal 
and vertical frequencies, the spectrum of the nonlinear filter shows a significantly higher 
response to horizontal and vertical frequencies than it does to frequencies corresponding to 
a diagonal direction. This effect is created by the specially chosen nonlinear filter structure 
and is similar to the passband of the human visual system. 

2.2.4 Stability and Finite Wordlength Considerations 

In section 2.1 the question of the allowable parameter ranges for a and a t and the resulting 
filter properties were left open. The most important aspects of this question will be addressed 
in this section. At first, the temporal filter will be considered: 

It is trivial to show, that stability of the ideal (infinite wordlength) realization digital 
filter yields the condition 


-1 < a t < 1 . 


If in addition, a lowpass filter with an aperiodic response is required, then the allowable 
range reduces to 

0 < a t < 1 . 

At the same time, this range guarantees the following properties: 


- The filter cannot produce an overflow situation, since the output will always stay within 
the same range as the input signal. This is particularly useful for image processing 
applications, since the occurrence of negative output values is also avoided. Since the 
occurrence of under- or overflow requires additional action, avoiding these situations 
is especially advisable for real time implementations. 

- If the filter is implemented in fixed point format, limit cycles do not exist for the mag- 
nitude truncation format. This is independent of whether the intermediate results are 
computed with full precision or whether quantization is performed immediately after 
multiplication. Due to the required computational speed, a fixed point implementation 
of the algorithm is preferable over the floating point option, although very powerful 
and fast floating point format hardware is already available. 

- For a t close to zero, the passband is very large. It tends to zero, as a t tends to 1 

- The filter always has a 0-dB DC-gain. 
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- In the range 0.8 < a t < 1, noticable artifacts can occur in temporally active areas. 

Let us now consider the 2-D spatial filter: 

Stability requires the following parameter range: 

—0.5 < a < 0.5 

Again, if in addition we impose the aperiodic response and lowpass condition, the range is 
reduced to: 

0 < a < 0.5 

Similarly to the 1-D case, the following properties are ensured through the above constraint: 

- The filter cannot produce an overflow or negative output values. 

- If the filter is implemented using a magnitude truncation format, no limit cycles can 
occur. 

- The bandwidth increases with decreasing values of a. 

- The filter always has a 0-dB gain at DC. 

- Values of a which guarantee a good performance depend highly on the high frequency 
content of the image sequence, and typically range from 0.25 to 0.4. 

Although this method is not restricted to first order linear modules, some of the advantages 
of this design are lost if higher order designs are used. This is due to the fact that generally, 
the aperiodic and limit-cycle free behavior is harder to ensure. Also, several parameters 
would have to be changed in order to modify the passband region. 

It should be mentioned at this point that if the temporal and the spatial filter is chosen 
according to the range condition mentioned above, all the resulting properties of the linear 
filtering modules also apply to the overall structure, i.e. stability, exclusion of limit cycles, 
0-dB DC-gain, and the preservation of the dynamic range of the input signal. 

2.3 Preprocessing for Dynamics Estimation 

The purpose of using a prefilter is usually signal conditioning for the algorithm, which 
follows the prefilter. Although the motivation for using the previously discussed 3-D filter 
lies in limiting the spectrum without visually degrading the image and removing noise (and 
therefore improving the attainable compression rate), it also preconditions the signal for the 
change detection algorithm. Although this algorithm is explained in detail in section 3, the 
main idea will be stated next, since it is mandatory for the understanding of how the 3-D 
filter conditions the signal for the change detection scheme. 
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Since we assume every image in the video sequence to consist of a stationary background 
and some active regions, one only needs to transmit the changing portions of each image. 
This requires the identification of temporally active image regions. For each of spatially 
separate blocks, some kind of activity measure is computed, and one can transmit only a 
certain portion of all blocks, i.e. the most active ones. Noise in the sequence will certainly 
increase the probability of error in choosing these blocks. This occurs especially in regions of 
motion with little texture and no significant edges. The chance of an inactive block yielding 
a higher activity measure than an active block is reduced by the use of the 3-D prefilter, 
since the filter removes noise very effectively in stationary areas but performs significantly 
worse in temporally active regions (See Fig. 12). As a result, the stationary region will yield 
a decreased activity measure, since the additional contribution of the noise is removed by 
prefiltering. 

Since the activity measure is computed blockwise, we wish to prevent the case in which a 
significant edge slowly moves out of a particular block, contributing only slightly to the activ- 
ity measure (See Fig. 16). Since an edge contains important syntactical image information, 
it is desirable that such a block is updated. The 3-D filter artificially increases the activity 
measure near edges, since its noise suppression capability is reduced in the neighborhood of 
2-D step functions, and increases the chance that this block will be properly classified. 

All the above remarks apply to the 3-D filter structure in its present form. Other potential 
useful properties for the change detection scheme arise, if the filter is modified slightly. 
Consider for example the case of combining the four outputs of the linear 2-D spatial filters 
to create a zero phase linear filter by simply adding the outputs at each point. Since the 
resulting filter is linear and shift-invariant, it will blur edges and therefore “spread” out highly 
localized activity, creating a more robust activity measure. This effect is demonstrated in 
Fig. 16. The blurred output signal is obviously used only for the detection scheme, rather 
than as an input signal to the coding algorithm. 
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3 Detection of Temporal Dynamics for Scene Segmentation 


In many typical applications of video systems, the temporally evolving scene can be usefully 
modelled as a static background with dynamic entities superimposed. While a practical 
complete system must generally deal with frames which have a large percentage of active 
area, we are presently concentrating only on scenes of limited dynamic area. Surveillance or 
remote monitoring settings may consist strictly of such configurations, and large segments of 
other typical image sequences have the same characteristic. We plan to use this spatial non- 
stationarity in temporal activity in sequence compression, and thus need a computationally 
simple and fast method for segmenting each frame into dynamic and non-dynamic parts. 
Adaptivity, which will make the algorithm more widely applicable, is to be introduced as a 
component of the coding system. 

The detection/estimation of dynamics is served well by the nonlinear filtering of Sec. 2 as 
a preprocessor. The reduction of uncorrelated noise allows reduced thresholds for detection at 
the pixel level, for greater sensitivity without significantly larger error probability in regions 
of the 3-D data set which are relatively constant. The filtering system serves, in a sense, as 
an activity detection in itself, since spatial filtering is suppressed in temporally active areas. 
Though we do not yet directly exploit this behavior, the greater variation remaining in the 
active areas triggers greater dynamics detection in the algorithm to follow. 

3.1 Outline of Algorithm 

An important premise of our work is that the image will be subdivided into non-overlapping 
square blocks, with their borders fixed. Without doubt, a better segmentation would be 
possible with arbitrary configurations and block sizes, but we maintain block borders for 
the sake of algorithm and hardware simplicity. The benefits of simplicity are for the sake 
of not only the detection/estimation portion of the system, but also subsequent coding. We 
are working under the constraint of real-time realizability. This requires that our system be 
implementable with high-speed circuitry for such tasks as transform coding. Block coding 
schemes normally operate on 8 x 8 or 16 x 16 blocks; thus we will restrict ourselves to blocks 
of at least 8x8. For the sake of accurate segmentation and reduction of artifacts, the smallest 
size possible is preferable. 

We assume that those blocks estimated to be inactive are not transmitted. As in predic- 
tive coding, the coder must store the image which will be reconstructed at the decoder as 
the “past” image. Thus we maintain an image at the coder which is updated only in active 
blocks. This will be the assumed form of the reference frame t — 1. The information we 
use as observations is the frame difference (FD), a simple pixel-by- pixel subtraction of the 
reference (past) image from the current true image, and will be expressed as 

d(ni,n 2 ,t) = x(n 1 ,n 2 ,t) — x(ni,n 2 ,t — 1). (7) 

The vector of parameters to be estimated is the binary classifications of individual blocks. 
If we use H as this vector, and d as the frame difference values, the Bayesian estimate is 
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formulated as 


Hmap = argmaxp(H|d) (8) 

H 

For computational feasibility, we use sub-optimal estimation, since the dimensionality of the 
problem makes true globally optimal estimation impractical. The form of the algorithm is a 
three-level hierarchy with simple, local computations at each level, requiring only one pass 
through the tree structure. 

3.2 Formulation of the Hierarchical Estimation Problem 

As a probabilistic model for the binary field consisting of the block classifications, we employ 
the Markov random field (MRF)[7]. The model for the field of block classifications H is 
defined by the Gibbs’ distribution: 

p( H ) = Z(o0 exp ( -ai ? v 

Each entry of the exponent is a function of local differences between Hi and its neighboring 
blocks. Z{a.\ ) is a normalizing constant called the partition function. In general, the true 
MAP estimate of (8) is a very computationally difficult task, since both the observation 
vector and H have many elements. The MRF model has been found useful in a wide variety 
of image segmentation problems[8, 9], and has a great computational advantage: the choice of 
a given block’s class, given the remainder of the blocks, is dependent only on a small number 
of neighbors in space. Typically, greedy optimization algorithms for MAP segmentation 
choose the value for the given point, given the current state of the field, which maximizes 
the a posteriori likelihood. We take a similar approach at the highest level in our hierarchy, 
considering individual blocks in turn. 

The performance of such greedy minimization procedures, however, can be profoundly 
influenced by the initial state. To achieve a good starting point of initial block classifications, 
we descend to the second level of the hierarchy. Considering the class of only the i-th block 
conditioned on the remainder of the field, we can simplify the expression of (8) by taking 
advantage of the fact that we now have a binary- valued prior, and can formulate the problem 
as a log likelihood ratio test. If we define the simple hypotheses as 

Hi = 0 : Block i is static 

Hi = 1 : Block i is temporally dynamic. 

Again using d as observations, the likelihood ratio becomes 

PiJU = l|d) P (d|fl, = = 1) 

P(Hi = 0|d) pfd|//, = 0)P(Hi = 0) k ' 

All probabilities are understood to be conditioned on the remainder of the field. 
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arg mgx p(hH)p(H) 



Figure 1: Hierarchy of detection/estimation for segmentation of active/inactive blocks. In 
this example, blocks are 4x4 pixels. 


But P(H{ = 0) and P(H{ = 1) are functions of the neighboring blocks in our model, and 
may be dropped for the (single block) hypothesis test on the second level. (We implicitly 
assume the unconditional P(H{ = 1) and P{H{ — 0) are both equal to 0.5.) It is generally 
simpler to deal with the log likelihood ratio, which for (10) becomes 


log 


P( d\Hj = 1) 


P(d\Hi = 0) 


The vector d has a dimension equal to the number of pixels in a single block, typically 61 or 
256 in our investigations. We model the field of pixel values in each frame, and hence the I D 
signal, as another MRF in order to account for spatial correlation among the dynamic pixels 
composing larger entities in a typical image. Even these block-wise decisions are therefore 
very complex when the observations include all the information in d. 

A choice similar to that we faced in the top level decision process now exists for the 
single block classifications. The optimal initial state can be called centralized detection for 
this binary choice, and is the obvious technique of choosing the hypotheses according to 
(10). However, given the Markov model for the difference under hypothesis H, = 1, and a 
zero-mean independent Gaussian under Hi = 0, we again face a complex computational and 
modelling problem in evaluating (10). To make the problem manageable (and tractable), we 
condense the information in d into binary threshold tests, which form the third level of the 
hierarchy. The resulting vector, which we will denote h, consists of binary entries representing 
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the results of pixel- wise hypothesis tests, representing the lowest level in Figure 1: 

hi = 0 : Pixel is static 

hi = 1 : Pixel is temporally dynamic. 

This distributed detection is suboptimal on the block level, but the loss in performance for 
a spatially constant signal, with the given number of samples, is very small[2]. In our case, 
in fact, where noiseless difference signals for dynamic areas may vary widely, the removal 
of magnitude information in the reduction of pixel-level information to binary values helps 
preserve areas of large-scale dynamics at low intensities. These areas may be perceptually 
more important than the magnitude of d in these regions would indicate. 


3.3 Execution of Algorithm 


The segmentation algorithm is computed in the opposite order to that followed above. Here 
we may follow the structure of Fig. 1, this time from bottom to top. At the pixel level, we 
first form the vector h in the manner of the likelihood ratio of Donahoe et. a/. [10] , via a 
likelihood ratio test 


or equivalently 


log 


' p(hj = jj di) ' 

,p( h i = °M«)> 
log fM\ki = i )P(hi = 


p(di\hi = 0 )P(h 


. = 0 )) s 


log 


f p{dj\hj = 1) ’ 

,p(di\hi = 0 ) 


< log 


P(hi = 
P(h , 


. = on 


( 12 ) 


(13) 


The prior probabilities for pixels, on the right-hand side of (13), can be approximated from 
the sequence, and are equal to the fractions of pixels in a typical frame which are active, and 
inactive, respectively. We estimate noise variance directly from the image data, using edge 
detection techniques to avoid contamination of the estimate by edge structure^]. 

If a pixel is inactive, our model of FD is spatially independent Gaussian noise. When 
a pixel is in a dynamic region, we make the simplifying assumption that its values in con- 
secutive frames are independent. Thus the pixel’s new value is assumed to be drawn from 
the distribution represented by the image’s histogram, and di has the distribution of the his- 
togram convolved with its time reverse. This is due to the fact that for independent random 
variables X and Y, 


f x-y{x ) = fx(x) * fy(-x) 

for probability density functions fx and fy. The threshold will be located at the crossing 
of the levels of the Gaussian density under the null hypothesis (hi = 0), and the histogram- 
based density under hypothesis hi = 1, scaled by the right-hand side of (13). In Figure 2, 
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Densities 



Figure 2: Probability densities of pixel difference value d,- under two hypotheses. The lower 
two curves are the histogram-based density(arrow), and density of the difference of two 
uniformly-distributed pixel values. Only the center section is shown; full region of support 
is (-255,255). 

we see that due to the large difference in variance under the two hypotheses, the threshold 
lies in an area of very high gradient in the Gaussian, and is therefore relatively insensitive 
to scaling of the log likelihood ratio by A. 

At the level of single block classifications, the distribution of the supra-threshold pixels 
is independent of spatial relationships under Hi = 0, since they are modelled as the result 
of independent noise. Their distribution is the Binomial, 

P(K = W = 0) = (14) 

where k is the number of l’s in h for the block. We use po as the probability of any d, 
exceeding the threshold in magnitude under Hi = 0, and p\ for the same under Hi = 1 . Under 
the hypothesis Hi = 1, a similar binomial could be used, but given the physical meaning of 
this hypothesis, the spatial relationships are relevant. To incorporate this feature, we return 
to the MRF, as expressed in (9). The same form applies, with each Hi replaced with h t , and 
a different scaling factor c* 2 . As in the block field case, binary entries in H mean that the 
cost function V[hi ) is only a scaled count of pixels which differ from neighbors. 

The standard form of the MRF is symmetric in the probabilities of l’s and 0's. A 
serious handicap of MRF’s is the intractability of computing Z(a)[S]. Therefore, choosing 
a multiplicative factor to fix p, at some value other than 0.5, while retaining the Gibbs’ 
distribution, is not practical using any known means. Using a symmetric MRF, with pi = 0.5, 
is also impractical when we know a priori that even under Hi = 1, the expected number of 
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active pixels is well under 50%. Our solution to this quandry is a combination of the MRF 
and the binomial distributions: 


nu _ p(h\Hi = 1) _ Z 1 (a)exp{-Q'E. € i/ocfeV;(h)}^(^)pf( 1 ~ Pi) N k 
W ~ P (h\H, = 0) - ( N k ) Po(l ~ Po) N ~ k 1 } 

This models both the desired pi , and spatial connectivity of active pixels under H, = 1. 
Cancelling common terms, and combining all those independent of h into a single constant 
Z(a), (15) becomes 


L(h) = 




(16) 


As mentioned earlier, the function VJ-(h) is only a count of pixels in the spatial neighbor- 
hood of pixel i whose values differ from h{. The single block log likelihood ratio is now quite 
simple: 


fclog 


' Pi(l -Po) ' 
,Po(l -Px) t 


-a? Yj ( N o(i)hi + Ni(i)(l - hi)) > log Z(a 2 ) 

i Eblock 0 


(U) 


No(i) and N\(i ) are the number of zeros and the number of ones, respectively, in the neigh- 
borhood of pixel i. The first term is a constant times the number of active pixels, the second 
a total of the number of pixels differing from their neighbors, scaled by a 2 , and the third 
is constant. Z(c* 2 ) is still too difficult to compute, but note that it affects the single-block 
likelihood ratio only by setting a threshold for the entire ratio. Rather than establish a value 
computationally for Z(a 2 ), we elect to set it by looking at several cases near the boundary 
between desired choices for Hi — 1 and Hi = 0. Choosing the constant to yield the desired 
choices in these guideline cases restricts the value to a very small range. 

The log likelihood ratio in (17) provides a test for initial classification of blocks in the 
image, and a scalar which serves as a sufficient statistic for each block to be used by the MAP 
segmentation in the final stage. At the top level, we use an iterative approach similar to 
iterated conditional modes [11] to approximate the MAP solution, with greedy minimization 
-for deterministic convergence. Our algorithm iterates among blocks, at each step changing 
the i-th block’s classification if it increases the value of the expression atop Fig. 1, conditioned 
on the remainder of the field: 


log 


, p(h\H i = l)P(H i = l\H j ,j^i 


< p(h\Hi = 0)P(H i = 0\H j ,j^i 


i}\ i 
V 0 


0 


(IS) 


For each block, the decision involves only a count of neighboring blocks of each class, and 
addition of the test statistic from the second level. Given a starting state for the field of blocks 
based on the block-wise LR test of (17), the estimate converges quickly to a segmentation 
with spatially clustered behavior as expected, plus good matches to data. 
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The entire algorithm can be simply summarized: 

1. Compute frame difference. 

2. Perform hypothesis test of (13) at each pixel, creating binary image of active/inactive 
pixels. 

3. Compute log likelihood ratio of (17) to classify each block. 

4. Iterate toward MAP estimate of block field, with values of the log likelihood of step 3 
as the test statistic, and MRF as prior for blocks. 

In general, we may stop at this point, and transmit replenishment data for the blocks 
classified as l’s. But for the sake of our experimentation, whose purpose was to test the 
algorithm’s performance with marginal cases, we fix the number of blocks to be transmitted 
at one-half the total. To choose them from the segmentation, we select the 50% with the 
largest final values for the test statistic in (18). 

3.4 Experimental Results for Segmentation of Dynamics 

Several examples of the performance of the dynamics estimation algorithm appear in Figures 
17-20. The “Walter” sequence consists of a moving head and shoulders on a background 
which undergoes multiple-pixel displacement between some pairs of frames. Image quality 
is relatively poor, but the sequence served as a good testing set for initial development. The 
second sequence is of much higher quality, and includes significant panning and zooming, 
as well as segments of foreground/background. This type of sequence is outside the class 
for which the algorithm was designed, but illustrates well the usefulness of the probabilistic 
approach. 

These images show the advantage of considering connectivity in the segmentation of 
the dynamics of the image sequence. Because its effects cannot be evaluated well without 
viewing of frame-rate video, we include only a few frames illustrating the effects of the use 
of connectivity at both the lowest and highest levels of the hierarchy. In both cases, a 
significantly greater clustering of updated blocks is visible, which more closely matches both 
the structure of objects in the image, and patterns of perception. 

The advantage of our estimation algorithm lies in two areas. First, edges which may 
have only a few active pixels in a given block are more reliably classified as active, due to 
the weighting of pixel-level connectivity. If we do not frequently force a complete frame 
replenishment, loss of update in the region of high-intensity edges may cause objectionable 
artifacts, as is illustrated in Fig. 20 due to the table border. The MRF block-level model 
improves sequence quality primarily in regions of relatively low intensity changes, but signif- 
icant perceptual importance. This is illustrated in the face of Fig. 18, and the darker areas 
of Fig. 20 
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4 Computational Aspects 


In this section, the feasibility of the proposed approach is investigated by analyzing the 
computational requirements and the necessary system performance. 

The number of pixels in a single frame will be denoted by FS (frame size) and the number 
of operations will be expressed in terms of addition/multiplications or in short add/mults 
and compare/swap operations (com/sw). Compare/swap operations arise from the median 
operations and cannot be expressed in terms of multiplications and additions, since the 
relationship depends on the algorithm and the particular computer system used. 

The total number of operations per frame for a straightforward implementation of the 
prefiltering algorithm is given by 18 FS add/mults + 9 FS com/sw. The corresponding 
number for the detection scheme requires 10 FS add/mults. The first number can be reduced 
significantly if the fact is used that the difference equations which describe the recursive filters 
can be implemented by a single multiplication. Using this implementation, the total number 
of operations for the filtering algorithm reduces to approximately 5 FS mults +13 FS adds 
+9 FS com/sw for each frame. 

Since the detection algorithm and the median operations are parallelizable down to the 
block and even the pixel level, the bottle neck is created by the recursive filtering algorithm. 
Although the four (five) filters can work in parallel, the individual recursive process cannot 
be parallelized in a simple manner. Therefore, the computational speed is dictated by the 
recursive processes and a minimum requirement (assuming a fully parallel implementation ) is 
given by approximately 1 FS mults +3 FS adds per frame. Using only partial parallelization 
(block instead of pixel level) this number is not increased significantly. 

The frame store requirement depends on the degree of parallelism desired and ranges 
between six and ten frames at any given time. 

The above numbers assume black and white sequences and are slightly higher for pro- 
cessing color sequences. 

If one assumes full parallelism, with a frame size of FS — 1 M, and a frame rate of 
30 frames per second, the required speed would be around 30M multiplications and 90 M 
additions per second, a rate which is within the range of nowadays top of the line signal 
processors. Motorola’s DSP 96002 for example performs 60 MFlops using a standard IF.FF 
Floating point format. Most of the operations required in the prefiltering and detection 
scheme require not more than 8 bits per data word. (Some operations are single bit ma- 
nipulations.) A fixed point format suffices for all cases. This illustrates that the required 
speed can be reached by current off-the-shelf hardware components. A throughput rate of 30 
Mbytes per second, and frame storage requirement of approximately 10 Mbyte are beyond 
the capacities of current signal processors, and customized VLSI designs would be necessary. 
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5 Conclusion 


The algorithms presented here have been shown useful as preprocessing tools in prepara- 
tion for compression of video sequences. During the coming year, further research on this 
grant will focus more directly on the compression problems, developing sub-band/transform 
techniques building on progress to date. The focus of filtering problems will shift from the 
preprocessing stage to band-splitting filters for sub-band coders. For more general appli- 
cability, the dynamics detection/estimation algorithm will be modified for adaptivity in bit 
rate allocation, necessary for scene changes and full frame dynamics, which occur in zooming 
and panning. 
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Figure 4: Initial input frame, providing the initial conditions for the 3-D filter transient 
response. 
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Figure 5: Stationary input frame after the scene change. 
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Figure 6: Transient filter response to a scene change: first frame after the transition. 
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Figure 7: Transient filter response to a scene change: second frame after the transition. 
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Figure 8: Transient filter response to a scene change: 10th frame after the transition. 
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Figure 9: Sample frame of a noisy, stationary sequence. White noise 77 = 0, a — 20. 
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Figure 10: Sample frame of the stationary filter output sequence, produced by the sequence 
in Fig. 9. 
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Figure 11: Noisy filter input signal of a non-stationary image sequence, frame # 15. 
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Figure 12: Filter output frame # 15 produced by the input signal of Fig. II. -9 
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Figure 13: Magnitude gain of the linear 2-D filter module for a = 0.4. 
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Figure 14: Magnitude spectrum of a noise image. 
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Figure 15: Magnitude spectrum of the output image corresponding to Fig. 14. The output 
is produced by the 2-D nonlinear subfilter. 



Robust Change Detection by Blurring: 
Original Sequence 
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Figure 16: 2-D linear zero-phase filtering (blurring) to achieve robust change detection. 
Illustrated grid delineates blocks for conditional replenishment. Actual image of moving 
circular object, with radius of blurring from linear filter(top). Result of choosing update 
blocks using active pixel count for decision, and no linear preprocessing filtering(bottom). 
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Figure 17: Segmentation of 50% of most active blocks from “Walter” sequence. Black pixels 
denote active pixels; gray blocks are transmitted, and white are not. Result using only pixel 
counts in individual blocks(left). Result for statistical estimation of segmentation(right). 
Differences in pixel-level detection are due to different histories, and consequently differing 
reference frames. Following figures have the same shading interpretation. 



Figure 18: Images from above sequence. 
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Figure 19: Segmentation of 38th frame from “table tennis” sequence, which occurs dur- 
ing camera zooming. Result using only pixel counts in individual blocks(top). Result for 
statistical estimation of segmentation(bottom). 









Figure 20: Images from sequence using 50% of sampling rate. Result using only pixel counts 
in individual blocks(top). Result for statistical estimation of segmentation(bottom). Note 
artifacts due to block size in both images. 
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